class: center, middle, inverse, title-slide # Statistics and such ## Learning from data ### Brendan Alexander ### 2020/7/28 (updated: 2020-08-19) --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif) <style type="text/css"> # .remark-slide-content { # font-size: 28px; # padding: 20px 80px 20px 80px; # } # .remark-code, .remark-inline-code { # background: #f0f0f0; # } # .remark-code { # font-size: 24px; # } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .tiny .remark-code { /*Change made here*/ font-size: 70% !important; } </style> Image credit: [Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif) - The EM algorithm at work - Definitely not classical statistics - Statistics, machine learning, and computer science have come together to create interesting solutions to data problems --- # Who am I? .pull-left[ - I work with population models that model population growth and inheritance. - Herbicide resistance/herbicide stewardship in waterhemp - Theoretical and empirical approaches MS Plant Science - (University of Alberta, November 2016) MS Applied Statistics - (University of Illinois Urbana-Champaign, August 2020) PhD Crop Science - not quite, but soon I hope... ] .pull-right[  ] --- # What is statistics? <!-- - Inverse probability (mostly a defunct definition, but I really like it) --> <!-- - Consider a 6 sided die. If it's a fair die then we know that the probability of rolling a **1** is `\(\frac{1}{6}\)` --> <!-- - But what if instead I tell you that I roll a die 5 times and the results are: 1, 1, 5, 20. What's the most likely type of die that I rolled? (What's the data-generating model?) --> - Statistics is all about learning from data - There are two main ways to learn: Inference and Prediction - There are *many* different ways to view statistics, but I think this is a pretty reasonable way to categorize things .pull-left[ **Inference:** Learning about mechanisms and relationships ] .pull-right[ **Prediction:** Using massive quantities of data to predict events (elections, sports, how many hamburgers to prepare ahead of a lunch rush) ] - The boundaries between these concepts are not black and white, this is just a general guide --- # Inference - Inference (Classical statistics, Frequentist, Likelihood-ist, Bayesian) - Exploratory: Graphing, finding patterns, hypothesizing - Confirmatory: Model building, hypothesis tests - I classify inference as when we want to learn about real world distributions, relationships, or mechanisms. - This is where most Agriculture data goes. - These are usually parametric problems (parametric defined later) - Graphing and model estimation - Linear regression (fitting a line) - ANOVAs, ANCOVAs (still just linear regression but now with fancy names) - Generalized models (logistic regression, Poisson/beta regression) - Mixed models - Non-linear models - Probability distributions --- # Parametric and non-parametric statistics **Parametric statistics** - We have an idea of what the probability distributions are, or we're at least willing to make some assumptions. - This is most statistics that we do in biology and agriculture, but there are important exceptions, especially with machine learning and prediction - Linear and non-linear models/Mixed models/Generalized Additive models (at least the error distribution) - Multivariate methods like PCA and Factor analysis **Non-parametric statistics** We don't know what the probability distributions are that our data represent and we are not willing to make assumptions - Rank methods/Kernel methods/Random forest - Machine learning methods tend to use non-parametric methods because the data sets are so large and have so many variables (features) that distributional assumptions can't easily be checked - Machine learning methods also tend to have prediction as the goal, not inference - In biology we tend to want inference first Both parametric and non-parametric methods can be used for inference, but parametric methods are more powerful (at the cost of making assumptions) --- .pull-left[ # Exploratory analysis *Hypothesis generating* - This is an approach to data analysis that focuses on non-parametric methods of interpreting data - Non-parametric really just refers to methods that don't try to use data to estimate parameters from probability distributions or models (which we'll get to in a bit) - We're looking for trends, relationships, and/or patterns that we might like to investigate further - Graphing data is a major exploratory method ] .pull-right[ **`mtcars` data** ```r pairs(mtcars[,1:5]) ``` <!-- --> ] --- # Confirmatory analysis *Hypothesis testing* .pull-left[ - Confirmatory analysis is the testing or evaluation of possible hypotheses/models using data. - These analyses are typically parametric, but not always. - These analyses almost always try to conduct hypothesis tests (not always correctly) $$ `\begin{aligned} H_0&: \beta_1=0\\ H_A&:\beta_1\neq0 \end{aligned}` $$ ```r fit <- lm(hp~cyl,mtcars) confint(fit) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -102.0743 -0.03442479 ## cyl 24.0265 39.89006527 ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="90%" /> ] --- # Lots of other output, what does it all mean? - There's way more output here than just the two parameter estimates for a linear function. - We aren't dealing with functions, we're dealing with distributions .tiny[ ```r summary(fit) ``` ``` ## ## Call: ## lm(formula = hp ~ cyl, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -54.61 -25.99 -11.28 21.51 130.39 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -51.054 24.982 -2.044 0.0499 * ## cyl 31.958 3.884 8.229 3.48e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.62 on 30 degrees of freedom ## Multiple R-squared: 0.693, Adjusted R-squared: 0.6827 ## F-statistic: 67.71 on 1 and 30 DF, p-value: 3.478e-09 ``` ] --- class: center # Functions vs. distributions .pull-left[ **Functions** No stochastic (random) component $$ `\begin{aligned} y&=\underbrace{\beta_0 +\beta_1x}_{deterministic}\\ \\ y&=2+x \end{aligned}` $$ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="50%" /> ] .pull-right[ **Distributions** $$ `\begin{aligned} y&=\underbrace{\beta_0 +\beta_1x}_{deterministic} + \underbrace{\epsilon}_{stochastic}\\ \\ y&=2+x + \epsilon, \:\:\:\: \epsilon\sim \mathcal{N(0,25)}\\ \end{aligned}` $$ <img src="index_files/figure-html/unnamed-chunk-7-1.gif" width="50%" /> ] ---
--- # Simulate our own model Let's create some linear regression data from the ground-up. We need: 1. A functional relationship between two variables. 2. A probability distribution to describe error. We'll use the normal distribution to start. `$$y_i = f(x_i)$$` `$$\epsilon_i \sim \mathcal{N}(0,\sigma^2)$$` The true statistical model is: `$$y_i \sim f(x_i) + \epsilon_i$$` or, for simple linear regression `$$\underbrace{y_i}_{DV} \sim \underbrace{\beta_0}_{Intercept} + \underbrace{\beta_1}_{slope}\underbrace{x_i}_{IV}+\underbrace{\epsilon_i}_{error}$$` where DV is dependent variable, and IV is independent variable. --- ## Create a functional relationship .pull-left[ `$$y_i=2+5x_i$$` ] .pull-right[ <div class="figure"> <img src="index_files/figure-html/unnamed-chunk-10-1.png" alt="A beautiful plot, by me. This is the plot [p1]. Notice that this is not a statistical relationship, there is no error." width="70%" /> <p class="caption">A beautiful plot, by me. This is the plot [p1]. Notice that this is not a statistical relationship, there is no error.</p> </div> ] --- ## Generate and plot the error .pull-left[ ```r set.seed(90210) e <- data.frame(error = rnorm(n = 50, mean = 0, sd = 10)) ``` Looks alright. Keep this in mind whenever you check the assumption of normality for real data: it's not always pretty, but sometimes it's good enough. ] .pull-right[ <!-- --> ] --- ## Create the statistical relationship .pull-left[ OK! Let's combine these components to simulate some regression data. .tiny[ ```r summary(lm(yvar~xvar,df2)) ``` ``` ## ## Call: ## lm(formula = yvar ~ xvar, data = df2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.0013 -5.5597 0.9162 5.4365 20.6247 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.645 2.958 0.894 0.376 ## xvar 2.002 0.101 19.829 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 48 degrees of freedom ## Multiple R-squared: 0.8912, Adjusted R-squared: 0.8889 ## F-statistic: 393.2 on 1 and 48 DF, p-value: < 2.2e-16 ``` ] ] .pull-right[ <!-- --> ] --- # Ordinary least squares (OLS) vs. Maximum likelihood (ML) We just used OLS, it's a very old technique based entirely on linear algebra. - OLS is just linear algebra and has minimal assumptions - it can be done easily by hand for small data sets - This is what people did back in 1930 - OLS produces the best linear unbiased estimates (BLUEs) as long as some assumptions are met: - Linear model is reasonable - Constant variance - Normality isn't technically an assumption for estimation, but it is for producing intervals - ML is very powerful but also makes a lot of distributional assumptions about the data - We're looking to maximize the likelihood of seeing our data by testing many different models/parameter estimates --- # Visualization of a log-likelihood surface (for ML) We fit fancy, complex models with maximum likelihood. But what is this mathematical voodoo? `$$\mathcal{L}(parameters|obs)=\prod_{i=1}^nf(obs_i)$$` The log-likelihood is usually used instead for numerical stability. Products change to sums when a `\(\log\)` function is applied. `$$\log{\mathcal{L}(parameters|obs)}=\sum_{i=1}^n \log{f(obs_i)}$$` The goal is to maximize the likelihood (or log-likelihood) to find good parameter estimates. --- # An example with the normal distribution .pull-left[ .tiny[ ```r x <- seq(30, 50, length.out = 100) hx <- dnorm(x,mean = 40,sd = sqrt(10)) ``` ] The log likelihood of seeing a `\(50\)` and a `\(38\)` when we have a normal distribution with `\(\mu =40\)` and `\(\sigma^2=10\)` is: .tiny[ ```r log(dnorm(x = 50,mean = 40,sd = sqrt(10))) + log(dnorm(x = 38,mean = 40,sd = sqrt(10))) ``` ``` ## [1] -9.340462 ``` ] ] .pull-right[ .tiny[ ```r plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Normal distribution") ``` <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="70%" /> ] ] What we want to do is explore a wide range of possible parameter values (parameter surface). We'll choose parameter values that have the maximum likelihood (log-likelihood) values. --- # Let's create some non-normal data! .pull-left[ .tiny[ ```r set.seed(90210) lldata <- data.frame(xvar=as.numeric(seq(0,5,0.1))) lldata2 <- lldata%>% mutate(lam=50*exp(-.9*xvar))%>% mutate(yvar=rpois(n = nrow(lldata), lambda = lam)) ``` ] Here's our equation: `$$\lambda_i=50\cdot e^{-0.9 \cdot x}$$` `$$y_i \sim Poisson(\lambda_i)$$` ] .pull-right[ .tiny[ ```r plot1 <- ggplot(data = lldata2,aes(x=xvar,y=yvar)) plot1 <- plot1 + geom_point() plot1 <- plot1 + labs(title="Plot of present count by distance from the Christmas tree.", x ="Distance", y = "Presents") plot1 ``` <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="70%" /> ] ] --- .pull-left[ <!-- We'll need to estimate some reasonable starting parameters. --> <!-- This can be a guessing game, but usually there are a few shortcuts. --> <!-- For example, if we log transform our data and perform a linear analysis then we can get reasonable starting values. --> <!-- Other models have similar tricks. --> <!-- If you know what your parameters represent, then you can often make good guesses from a graph. --> <!-- Now, remember that a model using the log-transformed values is not what we actually want. --> <!-- We're just after starting parameter estimates, and this should get us into the right ballpark. --> <!-- By log-transforming the observations we're trying to approximate the log-transformed response surface. --> <!-- `$$\log{y_i}=\log{a} + bx \approx \log{\hat{y_i}}=\log{a} +bx$$` --> <!-- These equations ARE NOT equal, but maybe we can get good estimates of the parameters. --> <!-- Notice that we have `\(\log{a}\)`, not `\(a\)`. --> <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="90%" /> ] .pull-right[ Looks neat! How does it compare to a real `glm()` fit? .tiny[ ```r fit_test <- glm(yvar~xvar,family=poisson,data=lldata2) coef(fit_test) ``` ``` ## (Intercept) xvar ## 3.9521029 -0.9092647 ``` ] Not bad! Looks like if we ran enough simulations that we'd probably get about the same answer. Note that: ```r exp(3.9521029) ``` ``` ## [1] 52.0447 ``` ] --- # How's it look in 3D? ```r plot_ly(x=search_grid$a, y=search_grid$b, z=search_grid$loglik, type="scatter3d",color = search_grid$loglik) ```
--- # Prediction example: Random forest